Physical functioning in the lumbar spinal surgery population: A systematic review and narrative synthesis of outcome measures and measurement properties of the physical measures

Background International agreement supports physical functioning as a key domain to measure interventions effectiveness for low back pain. Patient reported outcome measures (PROMs) are commonly used in the lumbar spinal surgery population but physical functioning is multidimensional and necessitates evaluation also with physical measures. Objective 1) To identify outcome measures (PROMs and physical) used to evaluate physical functioning in the lumbar spinal surgery population. 2) To assess measurement properties and describe the feasibility and interpretability of physical measures of physical functioning in this population. Study design Two-staged systematic review and narrative synthesis. Methods This systematic review was conducted according to a registered and published protocol. Two stages of searching were conducted in MEDLINE, EMBASE, Health & Psychosocial Instruments, CINAHL, Web of Science, PEDro and ProQuest Dissertations & Theses. Stage one included studies to identify physical functioning outcome measures (PROMs and physical) in the lumbar spinal surgery population. Stage two (inception to 10 July 2023) included studies assessing measurement properties of stage one physical measures. Two independent reviewers determined study eligibility, extracted data and assessed risk of bias (RoB) according to COSMIN guidelines. Measurement properties were rated according to COSMIN criteria. Level of evidence was determined using a modified GRADE approach. Results Stage one included 1,101 reports using PROMs (n = 70 established in literature, n = 67 developed by study authors) and physical measures (n = 134). Stage two included 43 articles assessing measurement properties of 34 physical measures. Moderate-level evidence supported sufficient responsiveness of 1-minute stair climb and 50-foot walk tests, insufficient responsiveness of 5-minute walk and sufficient reliability of distance walked during the 6-minute walk. Very low/low-level evidence limits further understanding. Conclusions Many physical measures of physical functioning are used in lumbar spinal surgery populations. Few have investigations of measurement properties. Strongest evidence supports responsiveness of 1-minute stair climb and 50-foot walk tests and reliability of distance walked during the 6-minute walk. Further recommendations cannot be made because of very low/low-level evidence. Results highlight promise for a range of measures, but prospective, low RoB studies are required.


Introduction
Musculoskeletal low back pain (LBP) persists as a leading global cause of disability from adolescence to old age [1].It is the most prevalent condition requiring effective rehabilitation [2] with best-evidence guidelines recommending interventions focused on self-management, physical and psychological therapies [3,4].For appropriate clinical indications, surgical interventions are effective in reducing pain and enhancing physical functioning [3,5].Selecting appropriate outcome measures for the lumbar spinal surgery population is important as population-specific outcome measures are recommended for use when measuring treatment outcomes for specific clinical populations and when focusing on the individual, an important component of providing patient-centered care [6,7].
International agreement supports physical functioning as the most important outcome domain to measure effectiveness of interventions for LBP [8].Physical functioning is the impact of a condition on physical activities of daily living (e.g., walking ability, performance status, disability index) [9].Use of patient reported outcome measures (PROMs) to evaluate physical functioning in LBP is common, despite low to very low quality evidence for their content validity [10].The Oswestry Disability Index (ODI) is most commonly used/recommended in LBP and lumbar spinal surgery [11,12].However, previous systematic reviews have highlighted a breadth of PROMs are used to evaluate physical functioning in LBP and lumbar spinal surgery populations [12,13].As these systematic reviews were either conducted over 20 years ago [13] or with a limited search strategy [12], a contemporary and comprehensive search of the literature will enable wider considerations of PROMs to evaluate physical functioning in lumbar spinal surgery populations.
Physical functioning is a multidimensional construct and necessitates evaluation with physical outcome measures, including impairments (e.g., strength), performance on a standardized task (e.g., 6-minute walk) and activity in a natural environment (e.g., step count) [14,15].Physical measures (e.g., time to symptom onset during 6-minute walk) are the measurement unit of interest.Physical measures are also an important component of assessment in LBP, informing clinical reasoning to formulate a diagnosis, prognosis and intervention plan.Despite growing widespread use in lumbar spinal surgery [16] and recognized value in other musculoskeletal conditions [17][18][19], recommendations for physical measures of physical functioning in the lumbar spinal surgery population do not exist.
Selecting outcome measures with adequate measurement properties is key for accurate interpretation of information gained during a clinical assessment and for measuring effectiveness of interventions in research and clinical practice.The COnsensus-based Standards for the selection of health Measurement Instruments (COSMIN) initiative aims to facilitate selecting high-quality outcome measures through a systematic evaluation of validity, responsiveness, reliability and measurement error of PROMs and physical measures [20,21].Use of outcome measures with adequate measurement properties supports clinicians in their clinical reasoning for accuracy in assessment and diagnosis, monitoring patient progress, evaluating treatment outcomes and making informed decisions to optimize patient outcomes.When selecting outcome measures, COSMIN also recommends considering interpretability and feasibility, as these are important for clinical understanding of outcome measure scores and application within the local clinical or research context [20].While use of physical measures in lumbar spinal surgery has risen exponentially [16], there is no systematic review evaluating measurement properties of physical measures of physical functioning in the lumbar spinal surgery population.While systematic reviews of PROM measurement properties exist [12,22], there is also no contemporary comprehensive resource outlining all PROMs of physical functioning, beyond the ODI, in this population.Objectives 1.To identify outcome measures (patient reported and physical) used to evaluate physical functioning in the lumbar spinal surgery population.
2. To assess the measurement properties and describe the interpretability and feasibility of physical measures of physical functioning in the lumbar spinal surgery population.

Design
Using a two-staged approach, this systematic review was conducted according to a registered (PROSPERO CRD42021293880) and published protocol [23].Stage one identified PROMs (excluding ODI) and physical measures used to evaluate physical functioning in the lumbar spinal surgery population.Results informed the stage two search strategy to identify studies of measurement properties of the physical measures, guided by COSMIN methodology [20,21].Reporting aligns with the Preferred Reporting Items for Systematic Review and Meta-Analysis (PRISMA) statement [24].Ethical approval was not required for this systematic review.

Eligibility criteria
Population.Adults aged �18 years listed for or previous lumbar spinal surgery for musculoskeletal LBP and/or low back-related leg pain.

Intervention.
Lumbar spinal surgery at one or more levels, including thoracolumbar and lumbosacral.Surgery due to trauma, fracture, space occupying mass (e.g., tumor), inflammatory conditions, infection, osteoporosis, congenital scoliosis, cauda equina syndrome and extra-spinal causes of back and/or leg pain were excluded.
Outcome measures.Stage one: Outcome measures evaluating physical functioning categorized as: 1. PROMs: questionnaires, scales or subscales assessing �1 aspects of physical functioning.
ODI was excluded as it is a well-established PROM of physical functioning in lumbar spinal surgery [11,25].
Study design of included studies.Stage one: All study designs and article types.
Stage two: Studies of measurement properties (validity, responsiveness, reliability, measurement error).Studies were excluded if data were not original (e.g., systematic review), normative only or insufficient (e.g., conference abstract).
For both stages, studies not in English were excluded.

Information sources
Searches were developed in MEDLINE (Ovid) and a librarian (MG) adapted for EMBASE (Ovid), Health and Psychosocial Instruments (Ovid), CINAHL (EBSCOhost), Web of Science Core Collection, Scopus, PEDro (stage one only), and ProQuest Dissertations and Theses.Electronic databases were searched from inception to December 15, 2021 for stage one and inception to July 10, 2023 for stage two.Reference lists of included studies in stage two were hand-searched independently by two authors (KK, JM) to identify additional potential articles.

Search strategy
Search strategies were developed in collaboration with a librarian (MG; S1 Appendix) and informed by National Institute for Health and Care Excellence (NICE) guidelines for LBP and sciatica over 16s [28].Stage two also included physical outcome measures identified in stage one and the COSMIN sensitive search and exclusion filter [29].An independent librarian peer-reviewed stage one search using the Peer Review of Electronic Search Strategies (PRESS) checklist.[30]

Selection process
Citations were imported into Covidence (Veritas Health Innovation, Australia) and duplicates removed.Title/abstract screening was performed independently in duplicate (KK, JM, AB).Full texts were obtained and reviewed independently in duplicate (KK, JM, AB) for articles meeting eligibility criteria or when eligibility was unclear.Disagreements at each stage were discussed, and a third author (AR) used if consensus was not achieved.

Data collection process and data items
Data were extracted independently in duplicate (KK, JM, AB) using standardized data extraction sheets.Data extraction included study characteristics, participant characteristics and outcome measures.Stage two was guided by the COSMIN scoring form, which also included measurement properties and information related to interpretability and feasibility, as recommended by COSMIN to aid selection of physical measures [20].Differences in data extraction were resolved through discussion.One investigator [31] was contacted and responded to one email to clarify reporting during stage two, in accordance with the a priori strategy for contacting study authors [23].

Risk of bias (RoB) in individual studies
As planned [23], RoB was not assessed in stage one.For stage two, RoB was assessed independently in duplicate (KK, JM) using the COSMIN Checklist [32] and extended tool for measurement instruments [21].For each study of a measurement property, RoB was rated as "Very good", "Adequate", "Doubtful", or "Inadequate" and the overall rating was determined using the worst score counts principle [21,32].Studies using hypothesis testing approaches that did not define a hypothesis or no hypothesis could be derived were rated as "Inadequate" because of strong potential for selective reporting of analyses and outcomes [33,34].Disagreements were discussed, and if consensus not achieved, a third author (AR) was used.

Data synthesis
For stage one, PROMs and physical measures were categorized according to established frameworks of physical functioning.PROMs were categorized according to the IMMPACT/OMER-ACT (Initiative on Methods, Measurement, and Pain Assessment in Clinical Trials, Outcome Measures in Rheumatoid Arthritis Clinical Trials) framework: general, site-specific, diseasespecific, pain-related physical functioning/activities or activities of daily living [14].Physical measures were categorized according to level-two categories of the International Classification of Functioning, Disability and Health (ICF) [15].For stage two, results of each measurement property study were rated as sufficient (+), insufficient (-) or indeterminant (?) criteria for good measurement properties (S2 Appendix) [20].Studies using a criterion approach were considered criterion validity or responsiveness [32].Studies using hypothesis testing approaches that did not define a hypothesis or no hypothesis could be derived were rated as indeterminant.Standards for assessing a priori hypotheses in hypothesis testing approaches have been removed from updated COSMIN guidelines with recommendations the systematic review team formulate hypotheses to evaluate results [20,32].However, as most studies did not define a hypothesis or no hypothesis could be derived, formulating post-hoc hypothesis for authors of included studies would have elevated RoB in this systematic review to an unacceptable level as lack of a priori hypotheses introduces threats to the internal validity of included studies and therefore this systematic review would have provided an inaccurate representation of the quality of the literature [35,36].
High heterogeneity and RoB directed a qualitative synthesis, in accordance with the a priori protocol [23].Summarized results were rated as sufficient or insufficient if at least 75% of individual studies were rated as sufficient or insufficient, indeterminant if at least 75% of individual studies were rated as indeterminant or inconsistent (±) if less than 75% of the individual studies agreed.Information related to interpretability and feasibility are described, in accordance with COSMIN recommendations [20].

Reporting biases
Assessment of reporting bias was conducted through evaluating consistency between published results and study protocols, if identified in stage two.

Overall quality of evidence
Quality of evidence was evaluated for each measurement property per physical measure, using GRADE (Grading of Recommendations Assessment, Development and Evaluation) modified for measurement properties [20].Four factors contributed to determining quality of evidence (RoB, inconsistency, imprecision, indirectness).Two reviewers (KK, JM) independently determined quality of evidence and disagreements were resolved through discussion.If consensus was not achieved, a third author (AR) was used.

Results
The PRISMA flow diagram (Fig 1 ) shows both stages of searching, selection and reasons for exclusion (S3 Appendix).For stage one, complete agreement was achieved between reviewers.For stage two, there was strong agreement between reviewers for title/abstract screening (κ = 0.85) and full text review (κ = 0.94) [37].Complete agreement on eligibility was achieved through discussion.Due to unclear reporting, the third reviewer (AR) was consulted once about one study [38] to agree which measurement property was investigated.

Stage one: Identify physical functioning outcome measures
Study characteristics.Stage one included 1,101 reports, published across 47 countries over 40 years (1982-2022) with increasing annual publications (S4 Appendix).The age of lumbar spinal surgery populations rose from early 40s in the 1980s to early 60s in the 2010s.For studies that reported sex/gender, reporting was binary and there was about equal representation of male/men and female/women.Common surgical procedures included fusion, decompression and discectomy.Physical functioning was assessed using PROMs in n = 964 reports and physical measures comprising impairments (n = 92 reports), performance (n = 198 reports) and activity in a natural environment (n = 42 reports; S4 Appendix).Most reports included >1 measure of physical functioning.
Results of synthesis.PROMs.70 established PROMs were identified and authors from 49 articles developed 67 of their own PROMs in physical functioning categories [14] of (S5 Appendix): • General (Established: n = 31, 44%; Developed: n = 41, 61%), including subcategories: physical activity, health status/quality of life, functional status, disability, patient-identified functional limitations  The most frequently used PROM to evaluate physical functioning was the Short Form Health Survey (e.g., SF-36, SF-12) physical component score and physical functioning domain (S4 Appendix).The Roland Morris Disability Questionnaire was the second most frequently used PROM with several versions and modifications (24-item, 23-item, for sciatica, substitute 'leg pain' for 'back pain', modifications not further specified).The importance of walking in the lumbar spinal surgery population is highlighted by 23 different stand-alone question/response option combinations to measure walking capacity.
Physical measures 134 physical measures were identified comprising impairments (n = 35, 26%), performance (n = 77, 58%) and activity in a natural environment (n = 22, 16%; S4 and S5 Appendices).Physical measures (specific measure of a physical assessment, e.g., lumbar flexion range of movement) were categorized into 17 physical outcome measures (broad outcome that is measured by physical assessment, e.g., range of movement) and mapped to 15 level-two categories in ICF components Body Function and Activities/Participation [15]: ICF Body Function component • Mobility of joint functions (physical outcome measure: range of movement) • Muscle power functions (physical outcome measure: strength) • Control of voluntary movement functions (physical outcome measure: motor control) • Gait pattern functions (physical outcome measure: gait parameters) • Exercise tolerance functions (physical outcome measure: aerobic capacity) • Maintaining body position (physical outcome measure: sustained positions) • Lifting and carrying objects (physical outcome measure: lifting) • Hand and arm use (physical outcome measure: reaching) • Walking (physical outcome measures: walking, composite gait measure) • Going up and down stairs (physical outcome measure: stairs) • Moving around in different locations (physical outcome measure: walking) • Other specified mobility (physical outcome measures: multi-activity performance-based measures, functional task performance, physical activity parameters) The most frequently used physical outcome measure was range of movement for impairment-based, walking tests for performance-based and physical activity parameters for activity in a natural environment.Similar to PROMs, the importance of walking is highlighted by 21 different walking physical measures (e.g., 6-minute walk test) across performance-based and activity in a natural environment physical outcome measures.

Stage two: Assess measurement properties of physical measures
Study characteristics.Stage two included 43 articles (Table 1) evaluating measurement properties of 34 physical measures of impairments (n = 8), performance (n = 18) and activity in a natural environment (n = 8).Studies were published across 13 countries over 24 years.The total number of participants was 4,619 and sample size ranged from 8-375 (median = 50).Mean age of participants was 57 (range: 27-91).Reporting of sex versus gender was variable (70% sex, 30% gender) and binary with slightly more men/males (54%).Surgeries included discectomy, decompression and fusion for lumbar disc herniation, spinal stenosis, degenerative disc disease and spondylolisthesis.
RoB in individual studies.Overall RoB in individual studies was rated as inadequate (78%, n = 76), doubtful (18%, n = 17) and very good (4%, n = 4; Tables 2-4).Key issues included lack of a priori hypotheses for hypothesis testing approaches and reporting measurement properties for comparator instruments (S6 Appendix).Complete agreement in RoB assessment was achieved through discussion (κ = 0.81).
Measurement property results per physical measure.Investigations included assessments of validity (n = 22 physical measures; Table 2), responsiveness (n = 20 physical measures; Table 3), reliability (n = 8 physical measures; Table 4) and measurement error (n = 6 physical measures; Table 4).The strongest evidence was moderate-level, supporting sufficient responsiveness of 1-minute stair climb and 50-foot walk tests, insufficient responsiveness of the 5-minute walk test, and sufficient reliability of distance walked during the 6-minute walk test.Very low to low-level evidence limits further understanding of measurement properties.Measurement properties for all physical measures are summarized in Table 5. S6 Appendix details individual studies of measurement properties per physical measure and overall quality of evidence.
Impairment-based physical outcome measures.Active range of movement: very low-level evidence supports indeterminant construct validity and responsiveness (construct approach) of computer assisted electronic inclinometer measures of lumbar, trunk and hip flexion and extension, and dual bubble inclinometer measures of lumbopelvic flexion and extension [40,44,45].Very low-level evidence supports indeterminant responsiveness (construct approach) of the Schober test [40].
Gait parameters: low level evidence supports indeterminant construct validity and sufficient criterion validity of the Two-step test [39].Very low-level evidence supports indeterminant responsiveness (construct approach) for asymmetry of double support and stride length using a wireless gait analysis system [31].
50-foot walk test: moderate-level evidence supports sufficient responsiveness (construct approach) of the 50-foot walk test [53] Modified Sorensen test: very low-level evidence supports indeterminant construct validity of the modified Sorensen test [47,48].
Self-paced walking test: very low-level evidence supports indeterminant responsiveness (construct approach) of distance and time walked during the self-paced walking test [68].
Treadmill test: for maximum walking time and time to first symptoms during a treadmill test, very low-level evidence supports sufficient pre-operative test-retest reliability and indeterminant post-operative test-retest reliability [49].Low to very low-level evidence supports indeterminant construct validity for maximum walking distance, maximum walking time, time to first symptoms and distance to first symptoms during a treadmill test [52,58].
Trunk muscle endurance: very low-level evidence supports indeterminant responsiveness (construct approach) of repetitive arch ups and sit ups until exhaustion [40].Activity in a natural environment physical outcome measures.
Step counts: very low-level evidence supports indeterminant construct validity and responsiveness (construct approach) of steps per day using a Fitbit (Flex 2, Charge, Zip) [71,74,75] and Mi Band [78].At the thigh, very low-level evidence supports inconsistent criterion validity of step detection (sufficient for ActivPAL3 and Jawbone UP Move, insufficient for Fitbit Flex) [73].At the wrist, very lowlevel evidence supports insufficient criterion validity of step detection using a Fitbit Flex and Jawbone UP Move [73].At the thigh and wrist, very low-level evidence supports indeterminant measurement error of step detection using Fitbit Flex, Jawbone UP Move and ActivPAL3 (thigh only) [73].Gait Posture Index: very low-level evidence supports indeterminant responsiveness (construct approach) of the Gait Posture Index using personal electronic devices (e.g., Garmin) or Mi Band 2 [72,76].
Distance walked per day: very low-level evidence supports indeterminant construct validity (using personal smart phone) [79] and responsiveness (construct approach, using Fitbit Zip) [75] of distance walked per day.
Gait cycles: very low-level evidence supports indeterminant construct validity of number of gait cycles per day, gait cycles per hour and gait intensities per day using the StepWatch3 [77].
Reporting biases.No study protocols of measurement properties were identified in stage two.Selective reporting of results was considered and reported within RoB assessment.

Discussion
Using a rigorous two-staged approach, this systematic review is the first to identify outcome measures (PROMs, physical) used to evaluate physical functioning in the lumbar spinal surgery population and assess measurement properties of the physical measures.Stage one generated a comprehensive list of PROMs (Established n = 70, Developed n = 67) and physical measures (n = 134).However, only 34 physical measures had investigations of measurement properties.Moderate-level evidence supported sufficient responsiveness of the 1-minute stair climb and 50-foot walk tests, insufficient responsiveness of a 5-minute walk test and sufficient reliability of distance walked during the 6-minute walk test.Very low to low-level evidence limits further understanding of measurement properties for a wide range of physical measures.

Stage one
The global importance of physical functioning in lumbar spinal surgery is emphasized by the breadth of countries represented in stage one and increasing number of publications across 40 years.This aligns with physical functioning advocated as a critical domain within core outcome sets for the past 25 years [8,80,81], and measurement instruments recommended for use within core outcome sets including well-established PROMs (ODI, Roland Morris Disability Questionnaire) [8,80,81].However, stage one identified an extensive number of PROMs across all five categories of physical functioning [14] and physical Active range of movement was included in 2 studies [44,45] (Inadequate RoB) evaluating construct validity of 2 physical measures.Computer assisted electronic inclinometer measures of lumbar spine, trunk and hip flexion and extension was compared to the Roland Morris Disability Questionnaire 1-2 days pre-operatively and 2 months post-operatively [44].Dual bubble inclinometer measures of lumbopelvic flexion and extension was compared to the North American Spine Society Questionnaire (Disability and neurogenic symptom subscales) and a straight leg raise at Physical Therapy pre-operatively, first visit post-operatively and discharge [45].
Two studies [40,44] (Inadequate RoB) evaluated responsiveness (construct approach) of 3 physical measures.Change in computer assisted electronic inclinometer measures of lumbar spine, trunk and hip flexion and extension was compared to change in Roland Morris Disability Questionnaire scores collected 1-2 days preoperatively and 2 months post-operatively [44].Change in Dualer goniometer measures of lumbar extension and the Schober test of lumbar flexion was compared to change in 15D health-related quality of life PROM collected 2 and 14 months post-operatively [40].

Physical outcome measure: Gait parameters
Physical measure: Two-step test Construct validity [39] ?Low Criterion validity [39] + Low Physical measure: Asymmetry of double support Responsiveness (Construct approach) [31] ?Very low Physical measure: Stride length Responsiveness (Construct approach) [31] ?Very low The Two-step test was included in 1 study [39] evaluating construct validity (Inadequate RoB) and criterion validity (Doubtful RoB).For both construct and criterion validity, Two-step test results were compared to the TUG test 1 day pre-operatively.Asymmetry of double support and stride length were included in 1 study [31] (Inadequate RoB) evaluating responsiveness (construct approach).Change in RehabGait system physical measures of asymmetry of double support and stride length was compared to change in ODI collected 1 day pre-operatively and post-operatively (10 weeks, 12 months).

Physical outcome measure: 1-min stair climb
Physical measure: Number of stairs Responsiveness (Construct approach) [53] + Moderate The 1-minute stair climb test was included in 1 study [53] (Very good RoB) evaluating responsiveness (construct approach).Change in the number of stairs climbed in 1 minute was compared to change in global perceived effect (construct-specific, general), physical measures (5-minute walk, 50-foot walk, TUG) and PROMs (ODI, back pain) collected 8-12 weeks pre-operatively and 6 months post-operatively.
Two studies [60,61] (Inadequate RoB) evaluated measurement error.Time to complete the 5 repetition sit to stand test was compared using test-retest in a clinical environment [60] and inter-rater agreement of a tele-supervised test performed at home [61].

Physical outcome measure: 5-min walk test
Physical measure: Distance walked Responsiveness (Construct approach) [53] -Moderate The 5-minute walk test was included in one study [53] (Very good RoB) evaluating responsiveness (construct approach).Change in the distance walked was compared to change in global perceived effect (construct-specific, general), physical measures (1-minute stair climb, 50-foot walk, TUG) and PROMs (ODI, back pain) collected 8-12 weeks pre-operatively and 6 months post-operatively.
Using the 6WT app, test-retest reliability and measurement error were evaluated by comparing repeated measures pre-operatively to 6 weeks post-operatively.One study [38] (Inadequate RoB) evaluated responsiveness (criterion and construct approaches).For criterion approach, change in distance to first symptoms measured using the 6WT app was compared to change in the Zurich Claudication Questionnaire satisfaction scale collected pre-operatively and 6 weeks post-operatively.For construct approach, distance to first symptoms measured using the 6WT app pre-operatively was compared to distance to first symptoms 6 weeks post-operatively.Responsiveness (Construct approach) [38] ?Very low Time to first symptoms during the 6-minute walk test was included in 1 study [38] (Inadequate RoB) evaluating construct validity.Using the 6WT app, time to first symptoms was compared to 5 PROMs pre-operatively and 6 weeks post-operatively.One study [38] (Doubtful RoB) evaluated reliability and measurement error.Using the 6WT app, test-retest reliability and measurement error were evaluated by comparing repeated measures pre-operatively to 6 weeks post-operatively.One study [38] (Inadequate RoB) evaluated responsiveness (criterion and construct approaches).For criterion approach, change in time to first symptoms measured using the 6WT app was compared to change in the Zurich Claudication Questionnaire satisfaction scale collected pre-operatively and 6 weeks post-operatively.For construct approach, time to first symptoms measured using the 6WT app pre-operatively was compared to time to first symptoms 6 weeks post-operatively.Walking speed during the 10-meter walk test was included in 1 study [69] (Inadequate RoB) evaluating construct validity.Walking speed was compared to the Pain Catastrophizing Scale pre-operatively and post-operatively at 3, 6 and 12 months.One study [69] (Inadequate RoB) evaluated responsiveness (Construct approach).Change in walking speed was compared to change in scores on the Pain Catastrophizing Scale collected pre-operatively between admission and surgery and postoperatively at 12 months.

Physical outcome measure: 50-foot walk test
Physical measure: Time to complete Responsiveness (Construct approach) [53] + Moderate The 50-foot walk test was included in 1 study [53] (Very good RoB) evaluating responsiveness (construct approach).Change in the time to complete was compared to change in global perceived effect (construct-specific, general), physical measures (1-minute stair climb, 5-minute walk, TUG) and PROMs (ODI, back pain) collected 8-12 weeks pre-operatively and 6 months post-operatively.
(Continued )  [56].Three studies [51,53,56] (2 Inadequate, 1 Very good RoB) evaluated responsiveness (construct approach) and one study [56] (Doubtful RoB) evaluated responsiveness (criterion approach).For construct approach, change in time to complete the TUG was compared to change in 7 PROMs, global perceived effect (construct-specific, generic) and physical measures (1-minute stair climb, 5-minute walk, 50-foot walk) collected pre-operatively and post-operatively (3 days, [51] 6 weeks, [51,56] 6 months [53]).Time to complete the TUG pre-operatively was also compared to time to complete 6 weeks post-operatively [56].For criterion approach, [56] change in time to complete the TUG was compared to change in the Zurich Claudication Questionnaire satisfaction scale collected pre-operatively and 6 weeks post-operatively.measures across 15 level-two ICF categories [15].Use of a range of measures aligns with previous systematic reviews highlighting limited and inconsistent implementation of recommendations for standardizing outcome measures in LBP clinical trials [82], and substantiates physical functioning as a multidimensional construct not best measured with a single PROM.Support for PROMs and physical measures to evaluate physical functioning in other musculoskeletal disease is strong, including international recommendations within clinical trials [17,19,83].In lumbar spinal surgery, there is emerging evidence demonstrating the value of physical measures (important to patients [59], responsive to change [38,53,56,59,66], predictive of outcomes [84]).However, recommendations for their use in LBP populations do not exist.Establishing consensus on appropriate physical measures of physical functioning is required to enable comparisons of interventions and outcomes.Stage one results highlight an illustrative example.Walking was frequently evaluated with PROMs and physical measures, aligning with previous research emphasizing walking as an important component of rehabilitation in lumbar spinal surgery [85][86][87][88].However, with 23 different stand-alone PROM questions/response options to measure walking capacity and 21 different walking physical measures, it is nearly impossible to compare across studies.Action is required to standardize measurement of physical functioning outcomes.Steps detected at the thigh and wrist was included in 1 study [73] (Doubtful RoB) evaluating criterion validity using ActivPAL3 (thigh only), Fitbit Flex and Jawbone UP Move.Steps detected by the activity monitors was compared to observed step count on the second or third day post-operatively.One study [73] (Doubtful RoB) evaluated measurement error using ActivPAL3 (thigh only), Fitbit Flex and Jawbone UP Move.Steps detected were compared to observed step count on the second or third day post-operatively.

Stage two
Sufficient measurement properties are required for recommending measures to use in research and clinical practice [89,90].While 134 physical measures were identified in stage one, only 34 had investigations of measurement properties.Few recommendations can be made because of indeterminant measurement properties and mostly very low to low-level evidence.The strongest evidence supports performance-based measures, with moderate-level evidence for sufficient responsiveness of 1-minute stair climb and 50-foot walk tests, insufficient responsiveness of the 5-minute walk test and sufficient reliability of distance walked during the 6-minute walk test.Very low to low-level evidence limits further understanding of measurement properties.However, results illustrate promise for a range of physical measures, for example when hypotheses are appropriately defined for construct approaches to validity and responsiveness [53,65] or no hypothesis is required to evaluate the measurement property [38,39,49,55,56,60,61,66,67,73].according to COSMIN [20].While emerging evidence supports the value of physical measures, at present few clear recommendations can be made.Prospective, low risk of bias studies are required.
A challenge in conducting this systematic review was inconsistent terminology and poor reporting in included studies.Inconsistent terminology led to lack of clarity in the measurement property under investigation.For example, three studies [38,55,65] reported an investigation of content validity, however study designs and statistical methodologies aligned with the COSMIN definition of hypothesis testing for construct validity.Poor reporting strongly contributed to indeterminant measurement properties, high RoB and very low to low-level evidence.It precluded COSMIN recommendations to derive hypotheses for study authors in the absence of a priori hypotheses in hypothesis testing approaches because of insufficient information about the expected direction and strength of associations.COSMIN guidelines were designed to enable systematic reviews but can also be used to inform terminology and reporting of studies on measurement properties.However, this is not common, as inconsistent terminology and poor reporting are challenges identified in other systematic reviews investigating measurement properties of physical measures [91,92].COSMIN has published reporting guidelines for studies on measurement properties [93], however lessons learned from Consolidated Standards of Reporting Trials (CONSORT) suggest at least 15 years may be required for widespread implementation [94].PRISMA-COSMIN reporting guidelines for systematic reviews of measurement properties [95] will hopefully hasten the process to draw clear conclusions about measurement properties of physical measures.
This systematic review demonstrates expanding use of digital technologies in remote monitoring of physical functioning.Several digital technologies were used (personal devices [38,55,56,59,61,65,72,76], low-cost consumer-grade [71][72][73][74][75][76][77][78] and research-grade wearables [31,73]) to measure impairments, performance and activity in a natural environment, with unique combinations in some studies.For example, free smartphone applications enabled digital selfassessments of performance-based measures (e.g., 6-minute walk) in a natural environment, but with limited and very low to low-level evidence supporting measurement properties [38,56,59,65].However, one unique study of patient-reported responsiveness indicated that patients perceive a smartphone application that measured 6-minute walk performance was better at detecting changes in their symptoms compared to PROMs (very low-level evidence) [59].In selecting appropriate digital technologies in lumbar spinal surgery, important factors to consider are wear position and gait aids, as both influence validity of step count detection (very low-level evidence) [73].Digital technologies are promising solutions to enable personalized remote monitoring, aid clinical reasoning and inform tailored interventions [96], however their measurement properties are largely unknown, necessitating low RoB studies.

Strengths and limitations
This robust systematic review used a rigorous two-staged approach to identify physical functioning outcome measures, enabling a comprehensive search for measurement properties of physical measures of physical functioning.It is limited by heterogeneity in physical measures, indeterminant measurement properties and RoB across included studies.Poor reporting of studies was a key issue.Important findings may have been missed with the exclusion of non-English full texts (108 in stage one, 1 in stage two).Limitations prevent clear recommendations for a range of physical measures of physical functioning in the lumbar spinal surgery population.

Conclusions
While many physical measures are used to evaluate physical functioning in the lumbar spinal surgery population, few have investigations of measurement properties.Research to date is overall low quality, consisting of high RoB studies with inconsistent use of terminology and poor reporting.The strongest evidence supports performance-based measures, with moderate-level evidence for sufficient responsiveness of the 1-minute stair climb and 50-foot walk tests, insufficient responsiveness of the 5-minute walk test and sufficient reliability of distance walked during the 6-minute walk test.There is promise for physical measures of physical functioning to demonstrate sufficient measurement properties, but few clear recommendations can be made.Knowledge of measurement properties is essential in establishing consensus on appropriate physical measures of physical functioning to evaluate effectiveness of interventions for lumbar spinal surgery populations and inform clinical practice.Prospective low RoB studies are required owing to emerging evidence demonstrating the value of physical measures.

•
Muscle endurance functions (physical outcome measure: muscle endurance) • Involuntary movement reaction functions (physical outcome measure: balance) ICF Activities/Participation component • Changing basic body position (physical outcome measure: functional mobility)

natural environment physical outcome measures Physical outcome measure: Step count
[49]n et al., 2000)d (NR) "most patients completed a full 15-minute examination, and there was little variability"(Deen et al., 2000).[49]CI,Confidence interval; ICC, intraclass correlation coefficient; LoA, Limits of agreement; MIC, minimum important change; mph, miles per hour; NR, Not reported; SEM, Standard error of measurement.

Physical outcome measure: Treadmill test
[49]s, radiological measures (degenerative findings, minimum area of dural sac, thecal sac cross-sectional area), and treadmill test time to first symptoms pre-operatively[58]and post-operatively at 6 months[58]and 11 years[52].Treadmill test maximum walking time, time to first symptoms and distance to first symptoms were compared to thecal sac cross-sectional area pre-operatively and 6 months post-operatively[58].One study[49](Inadequate RoB) evaluated reliability.Test-retest reliability of treadmill test maximum walking time and time to first symptoms was evaluated pre-operatively and 6 months post-operatively.

Physical outcome measure: Trunk muscle endurance
Inadequate RoB) evaluating responsiveness (construct approach) of 2 physical measures.Change in the number of repetitive arch ups and sit ups until exhaustion was compared to change in 15D health-related quality of life PROM collected 2 and 14 months post-operatively.

Physical outcome measure / physical measure Measurement property Overall Rating a Overall quality of evidence
[77]equate RoB) evaluating responsiveness (construct approach) using participants personal devices (e.g., Apple watch, Garmin) or Mi Band 2. Change in Gait Posture Index was compared to change in ODI and patient satisfaction pre-operatively and 3 months post-operatively.Inadequate RoB) evaluating construct validity using Apple Health activity data from an Apple iOS personal smartphone.Distanced walked per day was compared to distance walked during the 6-minute walk test and 3 PROMs pre-operatively and post-operatively at 6 and 12 weeks.One study[75](Inadequate RoB) evaluated responsiveness (construct approach) of distance walked per day (km / day) using a Fitbit Zip.Change in distance per day was compared to change in 5 PROMs 7 days pre-operatively and post-operatively at 1, 2 and 3 months.Number of gait cycles per day, gait cycles per hour, and gait intensities (>40 gait cycles per minute) were included in 1 study[77](Inadequate RoB) evaluating construct validity using the StepWatch 3. Number of gait cycles were compared to 4 PROMs and 4 radiological measures pre-operatively and post-operatively at 3 and 12 months.